hansi <- c(5.65, 5.25, 5.65, 5.35, 5.45, 5.55, 5.40, 5.50, 5.55, 5.25)Getting data into R
2025-08-31
In this class you’ll learn:
how to create small data sets in R using vectors and data frames,
how to import data from external files into R,
how to best organise data in files.
If you have small data sets, it will often be quickest to just create the data by typing it into a script
This is especially so if the data will only be used within this script
If the data are
prepare the data in a file and import them into R
We have two Devon Rex cats, Hansi and Apricot, who we weigh regularly
Here are the last 10 observations for Hansi:
| Weight (kg) | Weight (kg) |
|---|---|
| 5.65 | 5.55 |
| 5.25 | 5.40 |
| 5.65 | 5.50 |
| 5.35 | 5.55 |
| 5.45 | 5.25 |
Hansi’s weight is barely enough to activate our human digital scales, so we weigh ourselves with and without holding him, so there’s a lot of noise in these weights.
The simplest way to get working with these data in R is just to enter them as a vector
We use the c() function to combine values into a vector
Tip
Remember to separate each value with a comma , and space out the values to make them easier to read
Now we can use the data like any other vector
If we wanted to know Hansi’s average (mean) weight of the most recent weighings, we could use mean()
We see that Hansi’s average weight is 5.46kg.
We will often be working with more than one variable
In this case we also have the observations of Apricot’s weight for her last 10 weighings
| Weight (kg) | Weight (kg) |
|---|---|
| 3.15 | 3.35 |
| 3.40 | 3.05 |
| 3.20 | 3.40 |
| 3.40 | 3.25 |
| 3.50 | 3.20 |
We can enter Apricot’s data just as we did for Hansi
and calculate her average weight
The weights for Hansi and Apricot were observed at the same observation times
the first weight for Hansi was recorded at the same time as the first weight for Apricot
Instead of working with separate vectors, we can store the vectors in a data frame
This is how we’ll typically encounter data throughout this course
There is a separate video all about data frames
But simply, we can think of a data frame as R’s equivalent of an Excel worksheet
Now that we have the data in a data frame, we can use these data in models or plots
If you have
You should store the data in a file and load the data into R
You could store the data in many different ways using a plethora of softwrae applications
The best way to store simple tabular data is in a plain text file — .csv
You can also store your data in the newer Excel workbook format — .xlsx
Older Excel files were binary formats which could not be read by humans
I would recommend using CSV files, but Excel is also acceptable, especially for you own use
Tip
Use Excel to create a file, but save it as CSV
CSV stands for comma separated values
In CSV files, the data are stored row-wise, with the values of the different variables separated by a comma
The first few rows of the full set of weights for Hansi and Apricot in CSV format are:
Hansi,Apricot
5.65,
5.25,3.15
5.65,3.4
5.35,3.2
5.45,3.4
Notice that the observation for Apricot is missing for the most recent weighing
Often, character strings will be quoted:
"Hansi","Apricot"
5.65,
5.25,3.15
5.65,3.4
5.35,3.2
5.45,3.4
which would allow for fields (individual values) to contain ,, for example
One problem with this definition of a CSV file is that some countries use , for the decimal point
In such locales, the semi-colon ; is often used as the field separator
For example, in Denmark, Excel would create a CSV file that looked like this
"Hansi";"Apricot"
5,65;
5,25;3,15
5,65;3,4
5,35;3,2
5,45;3,4
We call the field separators, e.g., , and ; delimiters
Other file types make use of different delimiters
\t) for Tab-delimited or TSV filesYou need to be aware of how the file is delimited before you try to read it
It is a good ideo to open the file in a text editor (not Word) to identify the delimiter used
Base R comes with several functions for importing data from plain text files
However, we will use the readr 📦 from the tidyverse to import and export (read & write) plain test files
The package needs to be installed as it is not a standard R package
We load the package, whenever we want to use it, using
CSV files are imported using read_csv() or read_csv2() — the latter is for files using ; as the delimiter and , as the decimal separator
To import and CSV file located in the current folder, we can just provide the name of the file as a character vector
If the file is located in a different folder, we need to provide the path to the file.
For example if your working directory contains a folder named data and the data file is withing that we would use
where
"./" bit means the current folder,"data/" bit means go into the data folder in the current folder,We import the weight data for Hansi and Apricot using
Rows: 31 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (2): Hansi, Apricot
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 31 × 2
Hansi Apricot
<dbl> <dbl>
1 5.65 NA
2 5.25 3.15
3 5.65 3.4
4 5.35 3.2
5 5.45 3.4
6 5.55 3.5
7 5.4 3.35
8 5.5 3.05
9 5.55 3.4
10 5.25 3.25
# ℹ 21 more rows
It is best practice to provide the expected variable type for each column in the data set
That way, if the data differs from your expectations, readr will complain loudly
We specify the variable types using the col_types argument
# A tibble: 31 × 2
Hansi Apricot
<dbl> <dbl>
1 5.65 NA
2 5.25 3.15
3 5.65 3.4
4 5.35 3.2
5 5.45 3.4
6 5.55 3.5
7 5.4 3.35
8 5.5 3.05
9 5.55 3.4
10 5.25 3.25
# ℹ 21 more rows
Now that we have specified the column type, readr is much quieter, and just reads in the data
If you have data in an Excel file, the readxl 📦 can be used
# A tibble: 31 × 2
Hansi Apricot
<dbl> <dbl>
1 5.65 NA
2 5.25 3.15
3 5.65 3.4
4 5.35 3.2
5 5.45 3.4
6 5.55 3.5
7 5.4 3.35
8 5.5 3.05
9 5.55 3.4
10 5.25 3.25
# ℹ 21 more rows
We can again use col_types to tell read_xlsx() what data types to expect
But the format is different; we use "numeric" to tell read_xlsx() what types of data to expect
# A tibble: 31 × 2
Hansi Apricot
<dbl> <dbl>
1 5.65 NA
2 5.25 3.15
3 5.65 3.4
4 5.35 3.2
5 5.45 3.4
6 5.55 3.5
7 5.4 3.35
8 5.5 3.05
9 5.55 3.4
10 5.25 3.25
# ℹ 21 more rows
Even nicely arranged data, like my cats’ weight data, needs some cleaning to make it easier to work with
Usually we’ll want to clean the variables names to be consistent
We use the janitor 📦 and its clean_names() function
clean_names() turns variable names to lowercase, replaces spaces with _, & others
Many data problems originate from poor choices made at the time the data were entered into an electronic format
Follow the KISS principle: Keep it simple, stupid
Most projects involve multiple stages and data files
Having a nice, clean organisation for your files is important, e.g.,
data folder
raw-data and data, latter containing processed data filesscripts or analysis folderfigures for any plots you export to diskREADME.md file in your working directory
It is very easy to end up with a spreadsheet nightmare
Source: Data Carpentry
Formatting cells to convey data is not easily readable by a computer
Source: Luis D. Verde Arregoitia
If it’s important enough to note, make it actual data
Source: Luis D. Verde Arregoitia
Merging cells might get you a nice table, but it’s hard to read into a computer
Source: Luis D. Verde Arregoitia
Instead, repeat the data so each row / column has the same number of cells
Source: Luis D. Verde Arregoitia
Any labels go in the first row; these give the variable names
Don’t use subheadings to break data up
Source: Luis D. Verde Arregoitia
Instead, store the heading as a data; we can use the variable later to filter on or group by
Source: Luis D. Verde Arregoitia
Don’t use Excel for analysis or processing of data
Use it as a data entry tool
Keep your layout simple